DeepGenePrior: A deep learning model for prioritizing genes affected by copy number variants

The genetic etiology of brain disorders is highly heterogeneous, characterized by abnormalities in the development of the central nervous system that lead to diminished physical or intellectual capabilities. The process of determining which gene drives disease, known as “gene prioritization,” is not entirely understood. Genome-wide searches for gene-disease associations are still underdeveloped due to reliance on previous discoveries and evidence sources with false positive or negative relations. This paper introduces DeepGenePrior, a model based on deep neural networks that prioritizes candidate genes in genetic diseases. Using the well-studied Variational AutoEncoder (VAE), we developed a score to measure the impact of genes on target diseases. Unlike other methods that use prior data to select candidate genes, based on the "guilt by association" principle and auxiliary data sources like protein networks, our study exclusively employs copy number variants (CNVs) for gene prioritization. By analyzing CNVs from 74,811 individuals with autism, schizophrenia, and developmental delay, we identified genes that best distinguish cases from controls. Our findings indicate a 12% increase in fold enrichment in brain-expressed genes compared to previous studies and a 15% increase in genes associated with mouse nervous system phenotypes. Furthermore, we identified common deletions in ZDHHC8, DGCR5, and CATG00000022283 among the top genes related to all three disorders, suggesting a common etiology among these clinically distinct conditions. DeepGenePrior is publicly available online at http://git.dml.ir/z_rahaie/DGP to address obstacles in existing gene prioritization studies identifying candidate genes.

Research on the genetics of diseases has implications for diagnosing, treating, and developing drugs for these disorders. Understanding the genetic etiology of brain disorders can provide valuable insights into effective prevention and treatment methods. Gene prioritization, the process of identifying genes that most likely contribute to a disease or phenotype, can be used in BDs. This work uses case and control copy number variants as input to prioritize causal genes associated with BDs.
The prioritization of genes relies on various types of evidence. According to [12], gene-disease associations are grouped into five categories, namely functional, cross-species, same-compartment, mutation, and textual. The first category examines molecule interactions [13], while the second category discusses homolog genes that cause similar phenotypes in other organisms [14]. Same-compartment evidence is based on the fact that the gene is involved in known disease-associated pathways or compartments, such as the cell membrane or nucleus [15]. Mutation evidence is based on Single Nucleotide Polymorphism (SNP) and structural variants, which is also the focus of this study [16]. Text evidence can be obtained from online collections like PubMed [17].
Several gene prioritization methods have been reviewed in [18][19][20][21], and from a methodological point of view, they can be classified into statistical and machine learning methods. The first group primarily employs hypothesis testing, such as exact tests like Fisher's or permutation tests, to determine whether a gene is associated or not. However, several studies have reported p-value fallacies, such as distributional assumptions, limitations in data collection, and misleading results [22]. In addition, power loss and dependent values are discussed in detail as other criticisms of marginal p-values in [3]. Other issues can arise with these types of analyses, such as not considering all the heterogeneous features of genes.
Machine learning (ML) methods often rely on the 'guilt by association' (GBA) principle [23][24][25]. This principle suggests that the new genes associated with a disease interact with the most recently discovered genes in a network that encodes similarities between genes. Inference of different types of networks can then lead to the discovery of new genes. In other words, ML methods require seed data (in this case, genes that implicitly characterize the disorder) [18] and a similarity metric to determine which candidate genes are similar or associated with the seed(s). However, issues arise with this approach, as discussed in [23,24]. For instance, it is impossible to discover a novel gene association that does not relate to the previous ones. Additionally, genes of a novel genetic disease with unknown roots cannot be found [26,27] due to the dependency of these methods on prior information.
The issues discussed above hinder an ideal gene prioritization solution. To overcome these issues, we propose the DeepGenePrior method, which falls in the fourth category suggested in [12], as a deep learning architecture for gene prioritization. DeepGenePrior uses the well-studied autoencoder architecture with a variational learning framework. The Variational AutoEncoder [28,29] (VAE) is the stochastic variant of the autoencoder. Our method uses Copy Number Variants (CNV) data for gene prioritization. We train the network of neurons with all CNVs of cases and controls for all three diseases, followed by fine-tuning with the CNVs of the target disease. Controls and cases have zero and one CNV labels for the supervised learning phase. Finally, we build a score for every gene using the network weights and prioritize them. Our proposed method addresses gaps in previous studies and offers several advantages. First, it does not rely on theoretical assumptions like those in the hypothesis testing. Second, it does not require seed data, which is needed for methods based on guilt by association. Third, it does not rely on networks with false relations, like protein-protein networks.
We used CNVs from brain disorders to evaluate our method and compared them against major tools. We identified significantly mutated genes and found that our method detects genes that are 12% more enriched in brain expression than other tools. Furthermore, we compared the detected genes to those that cause nervous system phenotypes in mice and found our results to be 15% more enriched than other methods.
In addition, we examined genes that were exclusively overrepresented in one gender and analyzed the relationships between the detected genes and various phenotypes in the DECI-PHER data source and the gene ontology of the putative genes. We found three genes common among the top genes associated with all three diseases: ZDHHC8, DGCR5, and CATG00000022283. According to the literature [30], defects found in ZDHHC8 can be linked to susceptibility to schizophrenia. Also, we found that deletions in CYFIP1, PRODH, XXBAC, B444P24, LINC00896, ZDHHC8, AC006547, NIPA2, RTN4R, NIPA1, and TUBGCP5 are associated with schizophrenia and developmental delay.
The following section describes our algorithm, the data we used, and the experiments we conducted. We then discuss our results before presenting our conclusions and future work in the final section.

Prioritization of the genes in BDs
A deep learning model was utilized to identify the genes associated with BDs. The model was trained using copy number variants (CNVs) from all cases and controls, and the resulting model weights were employed to determine scores. The UCSC Lift Genome Annotations [31] tool was employed to convert all CNVs to the hg19 genome, and the locations of all CNVs were confirmed using NCBI remap tools [32]. CNVs smaller than one kilobase pair were excluded from the analysis.
The study showcases Tables 1, 2, and 3, displaying the top 40 genes for each disorder, accompanied by their respective p-values. Table 4 illustrates the methodology employed for Fisher's exact test. Specifically, CaseOV represents the overlaps between cases and the genes, while ControlOV represents the overlaps between controls and the genes.
The study presents Tables 1, 2, and 3, which exhibit the top 40 genes for each disorder along with their corresponding p-values. Table 4 shows the formulation of Fisher's exact test. CaseOV represents the number of overlaps between cases and the genes, while controlOV represents the number of overlaps between controls and the genes. Furthermore, we examined the genes that are associated with all of the three disorders and those linked with only two of them. COMT deletion is common between ASD and SCZ, while deletions in CYFIP1, PRODH, XXBAC-B444P24, LINC00896, ZDHHC8, AC006547, NIPA2, RTN4R, NIPA1, and TUBGCP5 are common between SCZ and DD. Next, common genes between ASD and DD are deletions in FAM57B, SHANK3, and BDH1, and the shared genes between the three disorders were deletions in DGCR5 and ZDHHC8. . A deep learning model learns the distinctions between cases and controls; then, the learned weights are used to prioritize the genes. After training, the results are evaluated using mutant mouse genes, human brain-enriched genes, DECIPHER data, and gene ontology analyses.
https://doi.org/10.1371/journal.pcbi.1011249.g001 In the subsequent sections, a comparison was made with machine learning methods, followed by a search for genes displaying brain-enriched expression. Notably, it was observed that many genes associated with brain disorders possess brain-enriched functions [33]. We compared our results with similar studies, demonstrating that our research successfully identifies more brain-enriched genes than previous investigations. Furthermore, we compare our findings to genes that cause nervous system phenotypes in mice, which were obtained from the MGI repository [34]. Our study demonstrates a higher fold enrichment than similar studies. The next step is identifying genotype-phenotype relationships using the DECIPHER data source [35], focusing on phenotypes exhibiting high enrichment levels.

PLOS COMPUTATIONAL BIOLOGY
In addition, we used WebGestalt [36] to perform gene ontology analyses of coding genes, with a focus on examining Gene Ontology (GO), Human Phenotype Ontology (HPO), and associated disease terms.

Comparison with machine learning methods
Next, we compare our method with machine learning methods for the gene prioritization problem. The selected algorithms were guided backpropagation (GBP) [37], deepLIFT [38], and DeepGenePrior (without pre-training). The third choice is to show the effect of pre-training on the performance of the whole model (an ablation study). DeepLIFT [38] is a referencebased global feature importance algorithm that uses a correlation score to measure the input's effect on the model's output. Guided backpropagation is a global feature importance that is gradient-based.
The performance benchmarks are computed as follows. The model is trained comprehensively, and important genes are selected based on their respective weights. Subsequently, the model is retrained using the identified important genes as inputs and the disease status as the output. The performance evaluation is then reported based on the test set. Global methods mainly suffer from many computations and estimates (making the model inaccurate). Deep-LIFT needs a reference for calculation; the reference is very influential in the final results of the model and may cause the model to choose the wrong inputs.
Guided backpropagation needs gradients, and it has been proven that the gradients can sometimes be noisy, resulting in the selection of irrelevant features. Other methods need several simple local surrogate models to interpolate the manifold in high-dimensional models (like LIME [39]); these surrogates impose massive calculations and imprecise the model. Some advantages of our proposed method are that it does not need the reference, does not rely on noisy data, and is not local, and there is a way to inject unlabeled as well as labeled data in the model.
The Python torch Captum [40] implementations of these algorithms were used for the comparison.
The results were reported in Table 5 regarding the accuracy and ROC AUC. Our DeepGen-ePrior algorithm Performs higher than the others. Besides, ROC curves are shown in Fig 2.  The yellow curve represents DeepGenePrior, the blue curve represents DeepGenePrior without Pretraining, the green curve represents Guided Backpropagation [37], and the Red is for DeepLIFT [38]. https://doi.org/10.1371/journal.pcbi.1011249.g002

PLOS COMPUTATIONAL BIOLOGY
DeepGenePrior: A deep learning model for prioritizing genes affected by CNVs

Overrepresentation of tissue-specific genes
Several studies (such as [41] and [42]) claim that brain-enriched genes play an important role in BDs. To determine whether the detected genes are overrepresented in the brain tissue, we compute the fraction of coding and non-coding genes that have been enriched (background expectation) and compare it with the percentage of genes that have overlapped with deleted or duplicated CNVs. The authors of [41] provide a list of brain-enriched genes. To obtain this list, they used the FANTOM5 CAGE-associated transcriptome [43] to identify coding and long non-coding RNA genes in the regions and examined their expression patterns across sample types.
In addition to alternative methods, we incorporated two gene prioritization tools, Gene-Friends [44] and ToppGene [45], both accessible online. GeneFriends applies the guilt by association approach, while ToppGene identifies candidate genes based on functional similarity to the training gene list. However, these tools possess certain limitations. Notably, they have a restricted capacity for accommodating large datasets, require seed data for achieving results (following the guilt by association principle), and rely on parameter tuning by the user, such as setting a Pearson correlation threshold and an FDR threshold. For this analysis, the default parameter values were employed. The results of our study are compared with those of Coe et al. [11] and Cooper et al. [46], two important studies of developmental delay. They were also compared with PLINK [47] and SNATCNV [41], publicly available tools with state-of-the-art performance.
In the list of brain-enriched genes related to ASD and SCZ, DGCR2 specifies a protein proposed to be important in neural crest cell migration [30]. The ZDHHC8 gene, strongly associated with ASD and SCZ [30], is another gene to note.

PLOS COMPUTATIONAL BIOLOGY
Next, we have some brain-enriched genes associated with SCZ; RTN4R is a gene in which adult central nervous systems are likely to be affected by its role in regulating axonal regeneration and plasticity. CATG00000058203 and Septin5 and CATG00000057131 are some brainenriched genes associated with ASD and SCZ, previously mentioned in [41].
As for the developmental delay, the DGCR5, PRODH, NIPA1, TUBGCP5, RTN4R, ZDHHC8, CRKL, and SERPIND1 genes are also brain-enriched and associated with the disease. Most of them are from the 22 nd and 15 th chromosomes (22q11.21).

Gene segregation analysis of male and female patients
Long-standing research shows that females are more tolerant of mutations than males, which explains why males are more prone to brain disorders such as autism. New studies also confirm the validity of previous findings [48][49][50] that male cases show more significant enrichment than female cases when comparing the ratios of cases to controls. In this research, we pointed out that some genes are more biased towards males, for example, deletion in PHF2 (ENSG00000197724), duplication in NRXN1 (ENSG00000179915), and deletions in WDFY3 (ENSG00000163625), PHF3 (ENSG00000118482), MED13L (ENSG00000123066), and WAC (ENSG00000095787), are more frequently seen in males than females for the developmental delay.
Besides, we performed the same analysis with ASD CNVs. We found that the PTCHD1 (ENSG00000165186) gene deletion occurred more in male than female patients. (Table 6 provides the details for these claims, and the chi-square test confirms the results).
Regarding the developmental delay, secondary conditions such as microcephaly (HP:0000252) and anxiety (HP:0000739) can be proposed, which was also suggested in [53]; This disorder has received less research. BCL9, FMO5, and GPR89B deletions related to microcephaly are also overrepresented in DD. NIPA1 duplication, associated with anxiety, is among the top genes of DD. In [53], microdeletion of the NF1 gene is found to be associated with microcephaly and DD.
Our model deduces a set of genes for a target genetic disease. We investigated the set of phenotypes related to the genes; the specific relationship between genes and phenotypes shows that there can be diversity in the etiology of the disease, which implies that the occurrence of a phenotype in a target disease is influenced by what candidate genes are mutated in the patient.

Analysis of biological processes and phenotypic ontologies of candidate genes
As part of our analysis, we used WebGestalt [36] to investigate the associations between identified genes and specific gene ontologies (GOs), human phenotype ontologies (HPOs), and disease terms [54,55].
Some examples of the discovered disease ontology terms were intellectual disability, language development disorders, poor school performance (for the developmental delay), autistic

PLOS COMPUTATIONAL BIOLOGY
behavior. Abnormal behavior is mentioned in [57], and impaired social interaction is discussed in [56] as phenotypes related to BDs.
The highlighted Gene Ontology terms include cognition, dendrite development, and synapse organization. In [58], dendrite development is pointed out to be associated with BDs, and the relationship between synapse organization and BDs is addressed in [59]. Tables 7, 8, and 9 summarize the results.

Overrepresentation of homologs of coding genes causative of nervous system phenotypes in the mutated mouse
Studying animal genetic mutations provides insight into disease mechanisms and treatments for brain disorders. Several animal models have been developed to uncover the disorder's process [60]. Mutant mice with specific defects in the nervous system are among them. Models based on mutant mice replicate key symptoms of brain disorders.
We investigate what percentage of the causative genes have homologs in mouse genes whose mutation causes nervous system phenotypes. For this purpose, we used the Mouse

PLOS COMPUTATIONAL BIOLOGY
Genome Informatics (MGI) database to identify genes related to the mouse nervous system and their human homologs. Fig 8 presents an analysis of the proportions of homologs among the identified genes exhibiting nervous system phenotypes in mice. The findings reveal that the coding genes identified by our method show a higher percentage of homologs in mutant mouse models displaying nervous system phenotypes compared to the results obtained from other methods. We also evaluated two gene prioritization tools, GeneFriends [44] and ToppGene [45]. For example, some genes that have orthologs in mice with nervous system phenotypes are SEPTIN5, RTN4R, and ZDHHC8. These genes are common among the three disorders.

Statistical analysis
Subsequently, an independent statistical analysis is conducted to compare the outcomes of DeepGenePrior with hypothesis testing performed in similar studies. Sample results are depicted in Fig 9. To assess whether the observed associations are statistically significant or occur by chance, 100,000 random permutations of case and control labels were performed. The corresponding results are illustrated in the respective diagrams.

Discussion
In this paper, we presented a deep learning approach that uses a variational autoencoder to analyze CNVs and prioritize genes within them systematically. Our deep learning model learns how features are distributed over samples, which enables us to predict the likelihood that a gene variation will cause a specific disease. We applied our method to three disorders under the umbrella term of brain disorders. We examined the results for overrepresentation of enriched brain coding, long non-coding RNA genes, and mouse orthologs with nervous system

PLOS COMPUTATIONAL BIOLOGY
potential candidates for involvement in DiGeorge syndrome pathology and schizophrenia. Additionally, the expression of MVP may serve as a prognostic marker for several types of cancer. For schizophrenia, DGCR6 and PRODH are well-known candidate genes, and DGCR5 is a long non-coding RNA gene with a high score for causing schizophrenia. Schizophrenia is a complex and debilitating mental disorder associated with genetic factors. One of the well-known candidate genes for schizophrenia is DGCR6, which codes for a protein. In addition, mutations in PRODH (Proline Dehydrogenase 1), located on 22q11.21, have been linked with susceptibility to schizophrenia (SCZD4). Another potential genetic factor for schizophrenia is DGCR5, which is a long non-coding RNA (lncRNA) with a high score for causing schizophrenia. [30] SEZ6L2 (Seizure Related 6 Homolog Like 2), located in 16p11.2, is another gene implicated in mental disorders. This region is thought to hold candidate genes for autism spectrum disorder. CDIPTOSP (CDIP Transferase Opposite Strand, Pseudogene) is a lncRNA gene associated with Central Nervous System Germ Cell Tumor disease. ASPHD1 (Aspartate Beta-Hydroxylase Domain Containing 1) is another gene linked with schizophrenia (specifically, Schizophrenia 3). Lastly, RANBP1 (RAN Binding Protein 1) is a protein-coding gene linked with Digeorge Syndrome.
Autism spectrum disorder (ASD) is a complex developmental disorder linked to genetic factors. One such factor is the deletion of DGCR2, which has been associated with a wide range of developmental defects. These defects are collectively called CATCH 22, which stands for DiGeorge syndrome, velocardiofacial syndrome, conotruncal anomaly face syndrome, and isolated conotruncal cardiac defects. Additionally, the ARVCF gene is responsible for autosomal dominant Velo-Cardio-Facial syndrome (VCFS), which is characterized by cleft palate, conotruncal heart defects, and facial dysmorphology. GNB1L is another gene that is deleted in DiGeorge syndrome. [61,62] Schizophrenia and panic disorder are two other mental disorders that have been linked to genetic factors. One such factor is the COMT (Catechol-O-Methyltransferase) gene, which codes for a protein and has been associated with both schizophrenia and Panic Disorder 1. Another gene linked to schizophrenia is ZDHHC8 (Zinc Finger DHHC-Type Palmitoyltransferase 8), which is located on chromosome 6q24-q25.
We also investigated the gender distribution of CNVs in BDs. We found that duplication in NRXN1 and deletion in PTCHD1 are more frequently observed in males than females for some of the BDs.

PLOS COMPUTATIONAL BIOLOGY
We observed that some brain-enriched coding genes were significantly expressed in all three disorders. Examples include DGCR2, SEPTIN5, and ARVCF, which are located on chromosome 22 and have a deletion associated with these disorders. These three genes were among the top ten coding brain-enriched genes related to the three disorders. We also found that DGCR5, a noncoding brain-enriched gene previously known as a biomarker for Huntington's disease, is highly associated with DD. AC000068 is a noncoding brain-enriched gene associated with SCZ and ASD. SEPTIN5 has been previously shown to be associated with ASD and SCZ, while DGCR2 is mainly known to be associated with SCZ. AC004471 is a noncoding brain gene among the top 10 genes related to SCZ, ASD, and DD.
Among the top genes with significant brain expression, some have orthologs in mice that showed nervous system phenotypes. SEPTIN5, ZDHHC8, RTN4R, and KCTD13 are the top genes for ASD and SCZ, while RTN4R and ZDHHC8 also rank highly in DD. SEZ6L2 is top in ASD but has a lower rank in SCZ. ZDHHC8 and RTN4R are genes with nervous system morphological and physiological phenotypes, while SEPTIN5 shows only nervous and physiological phenotypes in mice.
In the next step, we used DECIPHER [35] to examine the relationship between the detected genes and other phenotypes. We found that delayed speech, language, and autism were associated with the set of genes. According to our findings, seizures were associated with SCZ; this relationship was previously discussed in [63].
Microcephaly [64] and macrocephaly [65] are two reverse phenotypes associated with ASD and SCZ. Additionally, 'abnormal facial shape' is associated with all three disorders [66], which has also been studied in [67]. CACNA1H is one of the genes related to some overrepresented phenotypes [68], and TCF20, discussed in [69], is another gene highlighted in the heatmap of developmental delay.
We performed gene ontology analysis for the detected genes using the WebGestalt tool. This tool allowed us to perform gene ontology analysis, human phenotype ontology (HPO) analysis, and disease ontology analysis separately. For the disease ontology, some terms were "Language Development Disorders," "Autistic behavior," and "Congenital neck anomaly." Overrepresented HPO terms included "Severe global developmental delay," "abnormal social behavior," "Delayed speech and language development," and "Intellectual disability." Some of the most common gene ontology terms were "dendrite development," "cognition," and "Regulation of Neurological System Process." In summary, these findings support the biological relevance of the method-identified genes to genetic factors that contribute to brain disorders.
Although the application of our model focused on three specific brain disorders, it is important to note that our method is not limited to these disorders alone. The versatility of our approach allows for its application in any case-control study involving copy number variants associated with different target disorders. Consequently, the method inherently generates a list of candidate genes specific to the target disorder.
In future research, we plan to explore integrating network analysis techniques and combining mutational data with other auxiliary information, such as proteins or other modalities. This integration will enable the utilization of various modeling tools, such as graphs, to uncover additional patterns within the mutational data.

Data and preprocessing
In this study, we analyzed three case-control datasets comprising individuals with brain disorders, namely autism spectrum disorder, schizophrenia, and developmental delay. After preprocessing and quality control, the autism spectrum disorder dataset consisted of 47,119 cases and 24,858 control copy number variants (CNVs), as documented in the AUTDB database [41]. The schizophrenia dataset comprised 42,046 cases and 40,414 control CNVs [70]. The developmental delay dataset included 29,803 cases and 11,256 control CNVs. These datasets were selected based on their relevance to the genetic etiology of brain disorders and the availability of reliable and well-curated CNV data.
The final data source for developmental delay comprised two independent datasets with two different data types: NSTD 54 [46] and NSTD 100 [11]. We utilized the NSTD 100 dataset, which includes gender data. All CNVs in this dataset are rare, with a frequency of less than 1% of the population. Further details regarding these CNVs are reported in Table 10.
We used two supplementary data sources in our study. The first is the FANTOM5 (Functional Annotation of the Mammalian Genome 5) Atlas [71], which includes 21,069 proteincoding and 27,920 non-coding genes.
The second data source we used is the Database of Chromosomal Imbalance and Phenotype in Humans Using ENSEMBL Resources (DECIPHER, February 1st, 2017) [35]. This dataset contains information on patients, CNVs, and phenotypes such as ASD, DD, and SCZ. We investigate DECIPHER website to analyze the relationship between genes and other phenotypes and to augment and pretrain our system. Table 11 shows the statistics of the dataset.
This paper also analyzed tissue-enriched genes with a high brain expression level compared to other tissues. We utilized the list of brain-enriched genes provided in [41]. While [42] highlights the impact of brain-enriched genes on autism spectrum disorder, our study focuses on their effect on schizophrenia and developmental delay.

PLOS COMPUTATIONAL BIOLOGY
Additionally, we used MGI (Mouse Genome Informatics) data [34] to determine if candidate genes related to disease cause a nervous system phenotype in mice, following a similar approach as [41]. HTML was parsed from pages covering nervous system phenotype (MP:0003631) [72], abnormal nervous system morphology (MP:0003632) [73], and abnormal nervous system physiology (MP:0003633) [74]. The mapping was performed using [75]. The data preprocessing involved CNV filtering, conversion, and supplementary data cleansing (DECIPHER data analysis, FANTOM5 data, etc.).
For CNV filtering and conversion, we filtered out CNVs smaller than one kbps (similar to other studies such as [11,41,46]). The CNV studies also had different coordinates (hg17, hg18, and hg19). Therefore, we unified all CNVs and converted them to hg19 using the UCSC Lift Genome Annotations tools [31]. Moreover, we removed Y chromosome CNVs due to insufficient data, eliminating all CNVs with missing values.
We removed patients without phenotypes during supplementary data cleansing while using the DECIPHER data. Preprocessing was unnecessary for Fantom5, MGI, and brain genes since all gene coordinates were already in the hg19 format and ready for processing. Furthermore, we removed some genes that were not the result of the model, such as genes that overlapped more with controls than cases or genes that did not overlap with CNVs.

A formal overview of a gene prioritization system
In the context of gene prioritization, the process can be conceptualized as a system where the input consists of a target disease and a comprehensive gene list. Depending on the methodology employed for gene processing, various additional datasets may also be incorporated as auxiliary input. These datasets could include protein networks, pathway data, or reliable candidate genes associated with the target disease, thereby leveraging the "guilt by association" principle. The desired output is a list of candidate genes, which can be sorted or unsorted, representing the outcome of prioritization or classification. Furthermore, a scoring system may be implemented to indicate the likelihood of a gene's association with a particular phenotype or disease. The discriminatory algorithm aims to infer the role of each gene in the development of the target disease.
This section aims to provide a formal definition of our work. Consider a case-control study about a specific target disease. This study comprises copy number variants observed in both patient and healthy control groups. The CNVs can be defined as quadruples, characterized as follows: where ch is the chromosome number, the dosage is the type of CNV, either deletion or duplication, and strt and stp determine the region of the chromosome where the variation occurs. The CNVs are for people (specified by an identifier) whose features (like gender, other phenotypes, etc.) may or may not be available. This CNV is available in two sets: one for cases and one for controls.
Each rare CNV is related to an individual (characterized by p_id), either healthy or patient. Additionally, the dataset may optionally include auxiliary data for each individual, such as gender information. This supplementary information enables us to explore the discriminatory role of genes for each gender. Our objective is to address the gene prioritization problem utilizing a set of rare copy number variants.

The method overview
Compared to conventional machine learning methods, deep learning approaches offer the advantage of constructing a feature hierarchy and reducing data dimensions. This facilitates the identification of hidden patterns within the data more effectively than alternative approaches. An example of deep learning is the autoencoder, which plays a crucial role in dimensionality reduction and generating a concise, high-level representation of the data through a hierarchical arrangement of features [6]. The autoencoder consists of an encoder network (inference network) that progressively transforms the input into a low-dimensional latent representation and a decoder network (generative network) that strives to reconstruct the output to closely resemble the original input. Autoencoders have been widely employed in various bioinformatics problems [76][77][78].
Combining autoencoders with the variational learning framework results in the development of Variational Autoencoders VAE [28,79]. VAEs further enhance the capabilities of autoencoders. Fig 10 presents an overview of the VAE, illustrating its schematic representation.
The primary distinction between autoencoders and their variational counterpart lies in their inherent nature. Autoencoders operate deterministically, while variational autoencoders (VAEs) adopt a probabilistic approach. VAEs, in particular, employ regularization techniques to prevent overfitting during training. VAEs are founded upon the Bayesian theorem and inference principles, incorporating a regularization constraint. This framework assumes that the latent representation follows a multivariate Gaussian distribution, denoted as N(μ, σ).
Numerous studies have demonstrated that VAE exhibits enhanced stability during training and produces less ambiguous outputs than other generative models. This improved performance can be attributed to VAE's optimization of precise objective functions rooted in likelihood principles [81]. The posterior distribution in VAE is modeled as a Gaussian distribution, characterized by its mean and variance. It has been theoretically proven that this Gaussian distribution can approximate any function effectively. The primary objective of the VAE model is to encode the input data into a Gaussian distribution, estimating its mean and covariance.
VAE, a deep generative model that utilizes variational inference, is designed to discover a low-dimensional latent representation, denoted as z, for high-dimensional input data X, following the probability distribution P(X). To capture the intrinsic information of the input dataset, P(z|X), the estimation of the posterior distribution becomes necessary, which is typically intractable. By employing variational inference, a distribution family Q(z|X) (referred to as the variational distribution) is introduced to approximate the P(z|X) distribution. The objective is to minimize the Kullback-Leibler (KL) divergence (D) between these two distributions, serving as a dissimilarity measure.

D½QðzjXÞjjP zjX
After some calculations, we have the following objective function, which is the variational lower bound on log-likelihood: The first term is the expectation over the approximate posterior distribution (named as reconstruction error), and the second term (KL distance) is the regularizer (P (z) is standard Gaussian Distribution, N(0, I)). Q(z|X) is the encoding distribution, and P (X|z) is the decoding distribution.
Utilizing these equations transforms the minimization task into a maximization task. The encoder, denoted as Q(z|X), and the decoder, denoted as P(X|z), play crucial roles in this process. This goal can be achieved using deep neural networks coupled with stochastic gradient variational Bayes. In the VAE framework, the encoder component is employed to generate the parameters of the variational distribution. To mitigate overfitting, the dropout technique can be applied. The recognition model Q(z|X) takes the form of a multi-dimensional Gaussian distribution, where the network generates the mean and covariance of this Gaussian distribution. As for the latent space, a standard Gaussian distribution N(0, I) is employed as the prior distribution.
The loss function in VAE comprises two terms: the reconstruction loss, which facilitates efficient encoding and decoding similar to an autoencoder, and the regularization term, also known as the latent loss, which imposes constraints on the latent space. The regularization term approximates the latent space to follow a standard Gaussian distribution. To incorporate the regularization, the VAE loss function incorporates the Kullback-Leibler divergence, which encourages the covariance matrix to be close to the identity matrix and the mean to be zero.
The training process of the deep learning models consists of two phases: pretraining and fine-tuning. During the pretraining phase, the autoencoder is trained to learn high-level features from all the CNVs associated with the disorders. In the subsequent fine-tuning step, the decoder is set aside, and only the dedicated CNVs specific to the target disease are utilized for training.

The method details
In this section, we explain our method for prioritizing genes. An overview of the method is provided in Fig 11. A deep learning model is proposed for this task. According to the dataset for each disease, we have a copy number of variants for patients and healthy individuals. Each set of copy Fig 11. Overview of the Proposed Method. This figure presents a schematic representation of the entire process involved in the proposed method. The process consists of several sequential steps, as illustrated from left to right. The first step involves data preprocessing for data preparation. Initially, the data is obtained in various formats such as hg18, hg19, etc. To establish uniformity, the data is converted into a unified format, specifically hg19. Additionally, this step eliminates redundant, useless, and incomplete features from the data. In the second step, a model is constructed using the cleaned data. This model takes the form of an autoencoder. Subsequently, the weights of the network are adjusted using the corresponding labels. These labels assign values of zero or one to distinguish between healthy and patient individuals. In the fourth step, the autoencoder's coefficients are utilized to prioritize the genes. The importance of each gene is represented by the size of its corresponding icon in the figure. Larger icons correspond to more significant genes, whereas smaller icons indicate less important genes. https://doi.org/10.1371/journal.pcbi.1011249.g011

PLOS COMPUTATIONAL BIOLOGY
number variants for an individual has some overlaps with genes, which are features that feed into our deep learning. This is shown in Fig 12. We have a list of genes that we want to determine whether their expression will affect disease incidence; besides, we have a list of cases and controls with CNVs for a target disease. We want to convert them to a supervised learning algorithm.
We need to convert CNVs to genes for each healthy and patient individual. Computing overlaps can do this. For the set of genes preprocessed, as discussed before, we measure the length of overlap (in kbps) with the CNVs of an individual. The label of the training set is whether the person is healthy or patient (zero or one).
In the pretraining phase of the model, we used all the CNVs of the brain disorders (autism + schizophrenia + developmental delay). In the next stage, fine-tuning, the CNV of a specific disease is used. Thus, here we have used semi-supervised learning.
After our VAE has been fully trained, we just use the encoder part directly for the next step: 1. Train a VAE using all our data points and transform our data (X) into the latent space (Z variables) (We use all data in this step).
2. Solve a standard supervised learning problem with (Z, Y) pairs (Y is the label set).
The learning algorithm for the whole process is shown in Fig 13. In this algorithm, p is the true posterior, q is the approximate posterior distribution, z is the latent variable, θ is the decoder (z|x) parameters (generative model), and φ is the encoder (x|z) parameters (inference model).
Let's suppose that the encoder weights are represented by W m i�j , where m is the layer number, i is the output size in the last layer, and j is the input size in the current layer (no connection is determined by zero). As we know, the final layer that will be attached to the encoder is the label; and its size is one (whether the individual is patient (= one) or healthy (= zero)). If we multiply all weights matrices together, the result has the size input size × 1 (the matrices are multiplicable since the output of the last layer equals the input of the next layer). The resulting matrix (precisely column vector) can rank genes according to the label (the label is the status of the disease), and this is the same thing we want to model. The formulation is as follows: The specification of the deep learning model is such that a binary classification task is accomplished. The size of each layer is the square root of the number of nodes of the last layer. The final layer has a binary outcome, the last activation function is sigmoid, and loss function is binary cross-entropy, and the optimization algorithm is Adam.
Additionally, we investigate the novelty of the top ten genes in three disorders by conducting a comprehensive literature search (searching the 'gene name' + 'disorder name,' the gene will be labeled as known if a meaningful result is obtained. Table 12 shows the results of this investigation.

The detail of the implementation
The deep learning model has a training phase, which needs a training set including cases and controls. We developed the system with Python and PyTorch [83]. We used cross-validation and grid search to tune the parameters (like the number of neurons in each layer).
The activation functions are empirically selected Rectified Linear Units, and the weights were optimized by an adaptive optimization algorithm (Adam) [84] to minimize reconstruction error and loss. The decoder has a symmetrical structure to the encoder. The learning rate, decay rate, and epoch were set to 0.001 and 1, and at most 10,000, respectively. Also, we restrict connections to some extent for a reduction in parameters. The train/test ratio is set to 80/20. The number of layers is at most three. Since the technique is semisupervised, the first step is to use the data without labels to pretrain the network. The next step is to use the target data to fine-tune it. Next, we use the coefficients of the network to derive a score for each of the features of the input network, i.e., genes. The values of the scores are then sorted so that the relative usefulness of the genes can be evaluated.